-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OV] Fix data-free VLM compression via optimum-cli #1058
base: main
Are you sure you want to change the base?
[OV] Fix data-free VLM compression via optimum-cli #1058
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This fix relates to openvinotoolkit/openvino.genai#1348 |
@nikita-savelyevv, thanks for the PR. Please make sure that the tests you added don't increase the overall validation time dramatically. If so, please use smaller models instead, e.g. some dummy decoder instead of opt-125m. |
The *export* testing time has indeed increased by 4 minutes with this PR (31min now). But overall OV testing time is still limited by *diffusion* tests which take 33 min. I suppose in the near future we should address this, but it can be done in a separate PR. |
@nikita-savelyevv I tested the model exported with optimum-cli built from this branch with the code from the genai README, https://github.com/openvinotoolkit/openvino.genai?tab=readme-ov-file#run-generation-using-vlmpipeline-api-in-python . I exported with just I tested on Xeon, also tried with f32 INFERENCE_PRECISION_HINT, which did not make a difference. |
This is interesting. When running inference via optimum-intel I don't get an empty response. But when running inference via My code: import numpy as np
import openvino as ov
import openvino_genai as ov_genai
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from optimum.intel import OVModelForVisualCausalLM
model_path = "/home/nsavel/workspace/optimum-intel/MiniCPM-V-2_6"
image_file = "dog.jpg"
prompt = "Can you describe the image?"
# optimum-intel inference
raw_image = Image.open(image_file)
model = OVModelForVisualCausalLM.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
inputs = model.preprocess_inputs(text=prompt, image=raw_image, processor=processor, tokenizer=tokenizer)
generation_kwargs = dict(max_new_tokens=100, do_sample=False)
output = model.generate(**inputs, **generation_kwargs)
print("optimum-intel:", processor.decode(output[0], skip_special_tokens=True))
# openvino.genai inference
pipe = ov_genai.VLMPipeline(model_path, "CPU")
image = Image.open(image_file)
image_data = np.array(image.getdata()).reshape(1, image.size[1], image.size[0], 3).astype(np.uint8)
image_data = ov.Tensor(image_data)
print("\nopenvino.genai:", pipe.generate(prompt, image=image_data, max_new_tokens=100)) Output:
When model is compressed with
This is also the case when compression with |
Thanks for the sample code @nikita-savelyevv , I re-exported the model with group size 16 with your PR and observe the same as you did. OpenVINO GenAI inference works fine with group size 16, but not without it, both with and without your PR. Tested on Xeon with nightly/dev versions of openvino-genai and nncf. |
@echarlaix, @IlyasMoutawwakil, PR is ready for your review. |
@helena-intel I've created ticket 159295 on OV GenAI to examine empty generation result. |
@echarlaix @IlyasMoutawwakil could you please review this PR some time this week? I'm on a vacation starting from the next week. Thanks! |
@IlyasMoutawwakil, @echarlaix, kindly take a look at the PR as @nikita-savelyevv will be out till new year. |
What does this PR do?
Changes
When exporting an
image-text-to-text
model with optimum-cli in int4, all model components were compressed to int4. However, only language model should be compressed to int4 and other components should be compressed to int8_sym. The fix is to make VLM data-free compression run insidefrom_pretrained
call similar to data-aware case for LMs.Tests
Introduced additional checks for low-precision weight nodes of pipeline sub-models. This should prevent similar issues in the future.
Before submitting